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Abstract 


Nonnegative  matrix  factorization  (NMF)  is  a  linear  powerful  technique  for  dimen¬ 
sion  reduction.  It  reduces  dimension  of  data  making  learning  algorithms  faster  and 
more  effective.  Although  NMF  and  its  applications  have  been  developed  for  more 
than  a  decade,  they  still  have  limitations  in  modeling  and  performance. 

In  this  study,  we  design  rich  models  and  fast  algorithms  for  NMF.  We  have  found 
solutions  for  the  following  four  problems: 

1.  Accelerated  parallel  and  distributed  algorithm  for  NMF  with  Frobenius  norm 
with  L\  L2  regularizations.  The  algorithm  is  based  on  the  proposed  accelerated 
anti-lopsided  algorithm  for  nonnegative  least  squares.  It  attained  fast  over¬ 
bounded  guaranteed  convergence  0([(1  —  £)(1  —  i)2r]k )  in  the  space  of  passive 
variables,  where  convex  parameter  n  and  Lipschitz  constant  L  are  bounded  as 
\  <  H  <  L  <  r. 

2.  Fast  parallel  randomized  algorithm  NMF  with  Kullback-Leiblcr  divergence  with 
L\  L2  regularizations.  The  proposed  algorithm  has  fast  convergence,  and  utilize 
the  sparse  properties  of  data,  model  and  representation.  In  addition,  the  exper¬ 
iments  indicate  that  sparse  models  and  sparse  representation  are  archived  for 
large  sparse  datasets,  which  is  a  significant  milestone  in  this  research  problem. 

3.  New  models  of  simplicial  NMF  and  simplicial  nonnegative  matrix  tri-factorization 
with  Frobenius  norm  and  fast  parallel  algorithm.  The  new  models  have  more 
concise  interpretability  via  these  values  of  factor  matrices.  The  proposed  al¬ 
gorithms  based  on  a  combination  of  alternating  direction  and  Frank- Wolfe’s 
scheme  attain  linear  convergence  0(1/ k),  low  iteration  complexity,  and  easily 
controlled  sparsity.  The  experiments  indicate  that  the  proposed  model  and 
algorithm  outperform  the  NMF  model  and  its  state-of-the-art  algorithms. 

4.  New  model  of  simplicial  NMF  with  Kullback-Leibler  divergence  and  fast  parallel 
algorithm.  The  proposed  algorithm  has  guaranteed  instance  inference  with  sub- 
linear  convergence  0(l/k),  and  easy  sparsity  control.  The  experiments  indicate 
that  this  approach  can  achieve  high  sparse  representation  with  higher  accuracy 
in  comparison  with  similar  approaches. 
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Figure  1:  Nonnegative  matrix  factorization 

1.  Introduction 

1.1  Nonnegative  matrix  factorization 

Nonnegative  matrix  factorization  (NMF)  is  a  powerful  technique  widely  used  in  appli¬ 
cations  of  data  mining,  signal  processing,  computer  vision,  bioinformatics,  etc.  Fun¬ 
damentally,  NMF  has  two  main  purposes.  First,  it  reduces  dimension  of  data  making 
learning  algorithms  faster  and  more  effective  as  they  often  work  less  effectively  due 
to  the  curse  of  dimensionality.  Second,  NMF  helps  extracting  latent  components  and 
learning  part-based  representation,  which  are  the  significant  distinction  from  other 
dimension  reduction  methods  such  as  Principal  Component  Analysis  (PCA),  Inde¬ 
pendent  Component  Analysis  (ICA),  Vector  Quantization  (VQ),  etc.  This  feature 
originates  from  transforming  data  into  lower  dimension  of  latent  components  and 
non-negativity  constraints. 

Mathematically,  nonnegative  matrix  factorization  (NMF),  see  Figure  1,  is  formu¬ 
lated  as  follows: 

Definition  1  Given  a  dataset  consisting  ofm  vectors  in  n  dimensions  V  =  [Vi,  Vi,  •••, 
Vm]  G  Kfxm,  where  each  vector  presents  a  data  instance.  NMF  seeks  to  approximately 
factorize  V  into  a  product  of  two  nonnegative  factorizing  matrices  WT  and  F,  where 
W  G  Rrxn  and  F  G  RrXm  are  coefficient  matrix  and  latent  component  matrix,  respec¬ 
tively,  V  ~  WTF. 

In  another  mathematical  way  of  understanding  NMF,  the  factorization  is  the 
transformation  of  vectors  in  dataset  V  to  a  new  space  of  vectors  in  WT  to  obtain  new 
representation  F.  Specifically,  determining  F  is  considered  the  projection  of  datasets 
into  a  new  space  of  latent  factors  WT .  WTF  is  the  result  of  projecting  back  into  the 
original  space  of  new  coefficients  F. 
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Figure  2:  A  collection  of  documents  represented  by  weighted  words  can  be  analyzed 
as  product  of  two  matrices,  one  is  latent  factors  represented  by  weighted  words  and 
the  other  is  new  representation  of  documents  by  linear  combination  of  topics 

Furthermore,  the  whole  of  information  in  the  original  space  can  not  be  kept  be¬ 
cause  m,  n  is  often  bigger  than  r.  To  retain  the  properties  of  V,  the  factorization  needs 
to  guarantee  having  the  least  information  loss  via  minimizing  an  objective  function 

n  m 

D(V\\WT F)  =  d(14,j|  [WTF]t])\  where  d(x\\y)  can  be  considered  a  function  to 

*= 1  j= 1 

measure  the  difference  between  x  and  y. 

In  practice,  many  functions  are  employed  depending  on  the  properties  of  datasets. 
For  examples,  Frobenius  norm  is  suitable  for  image  datasets,  KullbackLeibler  (KL) 
divergence  is  employed  for  count  sparse  datasets  such  as  documents,  and  Itakura- 
Saito  (IS)  divergence  is  more  suitble  for  decomposing  a  power  spectrogram  in  musical 
applications.  In  this  research,  we  focus  on  two  of  the  most  popular  divergences: 

n  m 

•  Frobenius  norm:  J(V\\WTF)  =  || V"  -  WTF \\\  =  £  EO 'b  “  W7 FiY 

i= 11 

n  m 

•  KL  divergence:  J{V\\WTF)  =  E  E(Vy  ^  +  W?Fj) 

i=  1  1  1  3 

Regularizations  are  added  to  improve  the  quality  of  the  factorization.  Specially, 
L\  regularization  (/3i||IF||i  +  /?2 1 1  1 1  i  )  is  used  in  the  objective  function  to  enhance 

sparsity,  and  L2  regularization  (ai||IF|||  +  a^H-FjH)  is  used  to  enhance  smoothness. 

1.2  Motivation  of  our  work 

Although  nonnegative  matrix  factorization  and  its  applications  have  been  developed 
for  more  than  a  decade,  there  remains  many  open  problems  for  further  studies.  Some 
of  the  major  challenges  can  be  enumerated  as  follows: 

Challenge  1:  Rich  models.  Nonnegative  matrix  factorization  is  an  ill-posed  inverse 
problem.  In  other  words,  many  local  optimal  solutions,  called  as  stationary  points, 
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exist.  Hence,  rich  models  are  needed  that  can  concisely  interpret  the  relationship  ob¬ 
served  data  and  latent  factors,  and  have  the  high  quality  of  sparsity  and  smoothness. 
Rich  models  will  specially  describe  latent  factors  and  new  coefficients  to  enhance  the 
factorization  results.  In  addition,  model-based  approach  using  a  concise  specification 
language  to  describe  NMF  provides  significant  advantages  in  creating  highly  tailored 
models  for  specific  scenarios,  rapidly  prototyping  and  easily  understanding  various 
models  [1], 

Challenge  2:  Fast  convergence  and  low  complexity.  In  practice,  multiple 
iterative  algorithms  like  EM  algorithms  requiring  a  number  of  iterations  are  employed 
for  NMF  because  NMF  is  often  an  ill-posed  inverse  problem  and  its  objective  function 
depends  on  separated  sets  of  variables  WT  and  F.  The  number  of  used  iterations 
and  the  iteration  complexity  mainly  influence  the  speed  of  algorithm  performance.  In 
addition,  NMF  is  a  non-convex  optimization  problem,  yet  finding  new  W  and  F  when 
fixing  one  of  them  is  convex.  Hence,  algorithms  having  fast  guaranteed  convergence 
and  low  complexity  are  preferred  rather  than  heuristic  unstable  ones. 

Challenge  3:  Parallel  and  distributed  computation.  The  development  of  tech¬ 
nology  and  science  has  been  generating  huge  data  matrices,  which  makes  it  infeasible 
to  factorize  these  matrices  by  a  single  core  or  a  computer.  Hence,  the  computation 
must  be  parallelized  and  distributed  to  reduce  the  running  time.  The  major  challenges 
are  how  to  decompose  the  computation  into  small  independent  computing  units,  deal 
with  the  limited  internal  memory  and  various  constraints  of  smoothness,  sparsity,  and 
orthogonality.  Furthermore,  emerging  new  computing  models  such  as  Hadoop  and 
Spark  leads  to  redesign  parallel  NMF  algorithms  for  distributed  computation. 
Challenge  4:  Sparsity  of  models  and  representation.  Sparsity  is  a  natural 
property  of  data,  which  can  be  measured  by  the  number  of  non-zeros  elements.  For 
example,  the  number  of  words  in  a  topic  and  the  number  of  topics  in  a  document 
should  be  small.  Moreover,  sparsity  leads  to  save  used  internal  memory,  enhance 
the  speed  of  algorithms,  and  reduce  over-fitting  problems.  In  nonnegative  matrix 
factorization,  algorithms  need  to  attain  sparse  models  WT ,  sparse  representation  in 
F,  and  sparse  computation  that  utilizes  the  sparsity  of  WT  and  F  to  increase  their 
speed. 

In  this  research,  we  propose  rich  models  and  fast  algorithms  for  nonnegative  ma¬ 
trix  factorization,  see  Figure  3.  Concerning  rich  models,  two  rich  models,  namely 
NMF  with  L\  L2  regularizations  and  simplicial  NMF  that  is  extended  from  [2],  are 
proposed  to  enhance  sparsity,  smoothness  and  interpretability.  In  comparison  with 
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Figure  3:  Scheme  of  our  algorithms  and  their  relations 

the  previous  models,  the  proposed  rich  models  are  more  general,  robust  and  inter¬ 
pretable  to  deal  with  various  data.  Regarding  to  fast  algorithms,  we  propose  four 
fast  parallel  algorithms  for  four  NMF  variants  of  two  rich  models  and  two  diver¬ 
gences.  Furthermore,  the  proposed  algorithms  are  carefully  designed  to  achieve  low 
complexity  and  fast  convergence  to  deal  with  big  data. 


2.  Accelerated  parallel  and  distributed  algorithms  for  NMF 
with  Frobenius  norm  with  L\  L2  regularizations 

In  the  NMF  with  Frobenius  norm,  the  quality  of  this  factorization  is  controlled  by 
this  objective  function: 


D(V\\WtF )  =  ||F  -  WTF\\l  +  ai\\W\\l  +  a2||F||!  +  AlltFlK  +  A,||F||i 

To  be  fully  parallel  and  distributed,  the  optimization  is  totally  decomposed  into 
non-negative  quadratic  programming  (NQP)  problems  with  the  size  of  r,  which  is  the 
smallest  in  comparison  with  other  methods: 


D(V\\WTF)  =  Y,D(Vj\\wTFi)  =  Y.D{.v‘\Fwi) 

j=l  i= 1 
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The  proposed  method  is  basically  based  on  iterative  multiplicative  update  accel¬ 
erated  algorithm  (Algorithm  1).  To  adapt  to  big  data,  we  adjust  Algorithm  1  to 
a  parallel  and  distributed  one  using  limited  internal  memory  (Algorithm  2).  This 
algorithm  only  requires  limited  internal  memory  of  FTF ,  WTW  and  FVT  with  the 
size  of  0(r(r  +  n  j)  in  computing  nodes,  where  n  <C  m  for  large  datasets,  and  n  is 
limited  by  the  dimension  of  data. 


Algorithm  1:  Iterative  Multiplicative  Update  Accelerated  Algorithm 


Input:  Data  matrix  V  =  i  €  7 Z\ 


rxn  . 
+  ? 


,nxm  and  r. 

Output:  Latent  components  W  G  lZrxn. 

1  begin 

2  Randomize  r  nonnegative  latent  components  W  G  R 

3  repeat 

E-step:  Fixing  W  to  find  F+  such  that  the  accelerated  condition  is 
satisfied; 

M-step:  Fixing  F  to  find  W+  such  that  the  accelerated  condition  is 
satisfied; 

6  I  until  Convergence  condition  is  satisfied; 


The  most  major  challenge  of  Algorithm  2  is  to  solve  a  large  number  of  non¬ 
negative  quadratic  programming  (NQP)  problems.  Hence,  Algorithm  2  employs 
accelerated  anti-lopsided  algorithm  (Algorithm  3)  to  swiftly  seek  the  approximate 
solutions.  Specifically,  Algorithm  3  contains  four  main  steps: 

•  Part  1.  Anti-lopsided  transformation  from  Line  4  to  Line  6:  the  variable  vector 
x  is  transformed  into  a  new  space  by  x  =  go{y)  as  an  inverse  function.  In  the 
new  space,  the  new  equivalent  objective  function  g(y)  =  f{(p{y))  has  = 
1,  Vi,  or  the  acceleration  of  each  variable  equals  1.  As  a  result,  the  role  of 
variables  become  more  balanced  because  the  shape  of  the  function  becomes 
more  spherical  because  |g|  =  1,  Vi,  and  g{y)  is  convex.  This  part  aims  to  make 
the  post-processing  parts  more  effective  because  it  can  implicitly  exploit  the 
second  derivative  information  ^4  =  1,  Vi  to  guarantee  that  y  and  L  are  always 
bounded  as  |  <  y  <  L  <  r. 

•  Part  2:  Exact  line  search  from  Line  11  to  Line  13:  this  part  optimizes  the 
objective  function  with  a  guarantee  of  over-bounded  convergence  rate  0((  1  — 
f)k )  where  |  <  y  <  L  <  r  over  the  space  of  passive  variables,  which  has  a 
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complexity  0(r2).  The  part  aims  to  reduce  the  objective  functions  exponentially 
and  precisely,  although  it  suffers  from  variable  scaling  problems  and  nonnegative 
constraints. 

•  Part  3.  Greedy  coordinate  descent  algorithm  from  Line  15  to  Line  18  and 
repeated  in  Line  26:  this  part  employs  greedy  coordinate  descent  using  Gauss- 
Southwell  rule  with  exact  optimization  to  rapidly  reduce  the  objective  function 
with  fast  convergence  0(1  —  -^)  for  each  update  [3,  4],  which  has  a  complexity 
of  0(r2).  The  part  aims  to  reduce  negative  affects  of  variable  scaling  problems 
and  nonnegative  constraints,  although  it  has  zig-zagging  problems  because  of 
optimizing  the  objective  function  over  each  single  variable.  Due  to  having  fast 
convergence  in  practice  and  reducing  negative  affects  of  variable  scaling  prob¬ 
lems  and  nonnegative  constraints,  this  part  is  repeated  one  more  time  after  Part 
4. 

•  Part  4.  Accelerated  search  from  Line  22  to  Line  25:  This  step  performs  a 
momentum  search  based  on  previous  changes  of  variables  in  Part  2  and  Part 
3,  which  has  a  low  complexity  of  0(r.nn(r ))  where  nn(r )  is  the  number  of 
negative  elements  in  (xk+i  —  a Ax),  see  Line  25  in  Algorithm  3.  This  part  relies 
on  the  global  information  of  two  distinct  points  to  escape  the  local  optimal 
information  issues  of  the  first  derivative  raised  by  the  function  complexity.  This 
part  originates  from  the  idea  that  if  the  function  is  optimized  from  xs  to  Xk  by 
the  exact  line  search  and  the  coordinate  descent  algorithm,  it  is  highly  possible 
that  the  function  value  will  be  reduced  along  the  vector  (xk  —  xs)  because  the 
NNLS  objective  function  is  convex  and  has  (super)  eclipse  sharp. 

Theorem  2  The  exact  line  search  in  Algorithm  3  linearly  converges  at  the  rate  of 
0(1  —  j)k  in  the  sub-space  of  passive  variables,  where  \<  p  <  L  <r,  r  is  the  dimen¬ 
sion  of  solutions  or  the  number  of  latent  factors,  and  r  is  the  number  of  iterations. 

Theorem  3  Algorithm  3  has  over-bounded  linear  convergence  rate  of  0([(  1  —  £)(1  — 
ff)2r]k)  in  the  sub-space  of  passive  variables,  where  \  <  p,  <  L  <  r,  (1  —  £)(1  —  ff)2r 
<  (1  —  y-)(l  —  2hh)2r ,  r  is  the  dimension  of  solutions  or  the  number  of  latent  factors, 
and  k  is  the  number  of  iterations. 

Theorem  4  The  complexity  of  each  iteration  in  Algorithm  2  using  Algorithm  3  to 
solve  NQP  problems  is  0((m  +  n)r 2  +  mnr  +  k(m  +  n)r2).  In  addition,  it  is  0((m  + 

8 


DISTRIBUTION  A.  Approved  for  public  release:  distribution  unlimited. 


Algorithm  2:  Parallel  and  Distributed  Algorithm 
Input:  Data  matrix  V  =  { V} }]L \  €  Wf"  and  r. 
Output:  Latent  components  W  G  7?.™. 

1  begin 

2  Randomize  r  nonnegative  latent  components  W  G  1Z 

3  repeat 

4  y  =  0  G  7Lrn  /*  y  =  PVT*/; 

5  H  =  0  G  7Lrr  /*  H  =  FFt  /*; 

6  Q  =  WWT  G  7Tr; 

7  max  Stop  =  0; 

8  /*Parallel  and  distributed*/; 

9  for  j  —  1  to  m  do 

10  /*call  Algorithm  3*/ ; 

n  Fj  ~  argmin  —  VjWTx)\ 

x£HryO 

12  y  =  y  +  FjVj- 

13  |_  H  =  H  +  FjFj- 

14  /*Parallel  and  distributed*/; 

is  for  %  —  1  to  n  do 

16  /*call  Algorithm  3*/ ; 

17  Wi  ~  argmin  (xT^x  —  Y^x ); 

is  until  Convergence  condition  is  satisfied ; 
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Algorithm  3:  Fast  Combinational  Algorithm  for  NQP 


Input:  H  G  'Rrxr  and  h  G  7 Zr  and  xq 
Output:  x  ~  argmin  -.xT IIx  +  h 1  x 

x>^0 

1  begin 

2  /*Having  a  variable  maxStop  =  0  for  each  thread  of  computation  */ ; 

3  /*Re-scaling  variables*/; 

4  Q  =  * =  >WT;g= 


diag(H)diag(H)T  5  diag(H) 1 

/*Solving  NQP:  minimizing/ (x)  =  \xTQx  +  qTx*/] 
x  =  x0.  *  \J diag(H ); 

V/  =  Qx  +  g; 

repeat 

a;s  =  xfc_i  and  V/s  =  V/  ; 

/*Exact  line  search  over  passive  variables 
P(x)  =  {i  |  Vif(x)  <  0  or  Vif(x)  >  0  and  [x]{  >  0}*/; 

V/=  V/;  and  V/[x  =  0  and  V/  >  0]  =  0; 

a  =  arg  min  f(xk  -  aS7  j)  = 

=  xfc_i  -  aV/;  V/fc  =  V/fc  -  aQVf-  Q[xk}-,  xk  =  [xk}  +  ] 
/*Greedy  coordinate  descent  algorithm*/ ; 
for  t—1  to  n  do 

p  =  argmax  |Vj/(xfc)|; 

i£P(x) 

A xp  =  max( 0,  [xfc]p  -  -  [xk\p; 

V/  V/  A  QpAxp,  [x/jjp  A  A  Xp, 

if  (|| /*||!  <  e||/o||!)  or  ( II  fk III  <  maxStop )  then 
|  break; 

/*Accelerated  search  carries  a  ’’momentum”  based  on  the  changes  of 
variables  in  exact  line  search  and  greedy  coordinate  descent  part*/ ; 
Ax  =  xs  —  xk  /*xs  and  V/s  are  assigned  in  Line  9*/; 

a  =  argmin  f(xk  -  a  Ax)  =  *£<?Ax  =  AAXv/f-v/P 

O' 

xk  =  xk  —  aAx] 

V /  =  V /  -  aQAa;  -  Q[xfc]_;  =  [xk]+\ 

Repeat  steps  in  the  part  of  greedy  coordinate  descent  algorithm; 
until  (\\fk\\l  <  e\\fo\\l)  or  {\\fk\\l  <  maxStop ); 
maxStop  =  max(maxStop,  ||/fc|||); 

return  ,  Xk 


f  diag(H) 
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Tabic  1:  Dataset  Information 


Data-sets 

m 

n 

r 

Maxlter 

Faces 

6977 

361 

60 

300 

Digits 

6.104 

784 

80 

300 

Tiny  Images 

5.104 

3,072 

100 

300 

Nytimes 

3.105 

102,660 

100,. ..,200 

300 

[«~MUR  ■  PrG  •  BIP  ■  AcS-^- FCD-^AcH  ■  Nev---Alo 
Faces  Digits  Tiny  Images 


Figure  4:  Objective  function  values  \\V  —  WT F 1 1| / 2  versus  CPU  seconds 


n)r2  +  rS(mn)  +  k{m-\-n)r2)  for  sparse  data,  where  S{mn )  is  the  number  of  non-zero 
elements  in  data  matrix  V . 


The  experimental  comparative  evaluation  is  done  on  four  datasets  (see  Table  1), 
and  the  proposed  algorithm  (Alo)  is  compared  seven  state-of-the-art  belonging  differ¬ 
ent  research  directions:  Multiplicative  Update  Rule(MUR),  Projected  Gradient  Meth- 
ods(PrG),  Block  Principal  Pivoting  method(BlP),  Fast  Active-set- like  method(AcS), 
Fast  Coordinate  Descent  methods  with  variable  selection(FCD),  Accelerated  Hierar¬ 
chical  Alternating  Least  Squares(AcH),  and  Nesterov’s  optimal  gradient  method(Nev). 

Figure  4  and  5  show  that  the  proposed  algorithm  converges  fastest  versus  both 
time  and  iteration.  Furthermore,  the  proposed  algorithm  attains  the  best  optimal 
values  in  two  large  datasets,  and  it  is  stable  when  it  takes  only  one  iteration  in  average 
k  to  satisfy  the  stopping  condition.  These  experiments  indicate  the  strength  and 
robustness  of  the  proposed  algorithm,  specially  for  accelerated  anti-lopsided  algorithm 
for  nonnegative  quadratic  programming  problems. 
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Faces 


Tiny  Images 


Figure  5:  Objective  function  values  \\V  —  WT F^\/2  versus  the  iteration  number 


Table  2: 

Optimal  Values  of  NMF  solvers 

Dataset 

MUR 

PrG 

B1P 

AcS 

FCD 

AcH 

Nev 

Alo 

Faces  (xlO8) 

3.142 

2.003 

1.975 

1.975 

1.983 

2.058 

2.003 

1.985 

Digits  (xl0lcr 

) 

4.659 

1.639 

1.641 

1.641 

1.644 

1.640 

1.646 

1.639 

Tiny  Images  ( 

X 

O 

o 

6.925 

3.483 

3.472 

3.472 

3.474 

3.476 

3.473 

3.467 

Table  3:  Average  of  Iteration  Number  k 


Dataset 

MUR 

PrG 

B1P 

AcS 

FCD 

AcH 

Nev 

Alo 

Faces 

1.00 

321.12 

1116.96 

102.09 

1.54 

1.11 

29.21 

1.01 

Digits 

1.00 

36.70 

12503.75 

305.94 

1.00 

1.05 

23.36 

1.00 

Tiny  Images 

1.00 

767.45 

12869.12 

1086.51 

1.38 

2.52 

29.32 

1.02 
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3.  Fast  parallel  randomized  algorithm  NMF  with 

Kullback-Leibler  divergence  with  L\  L2  regularizations 

In  the  NMF-KL  problem  with  L\  L2  regularizations,  the  quality  of  this  factorization 
is  controlled  by  the  objective  function  as  follows: 


n  in  t  t- 

D(V\\\VtF)  —  ^  log  yyYjjr-Vij+W?Fj)+ai\\W\\l+a2\\F\\l+pl\\W\\i+P2\\F\\l 

i=  1  j= 1  1  ^ 

We  employ  a  multiple  iterative  update  algorithm  like  EM  algorithm,  see  Algo- 
rithm  4,  because  D(V\\WT F)  is  a  non-convex  function  although  it  is  a  convex  func¬ 
tion  when  fixing  one  of  two  matrices  W  and  F.  This  algorithm  contain  a  while  loop 
containing  two  main  steps:  the  first  one  is  to  optimize  the  objective  function  by 
F  when  fixing  W;  and  the  another  one  is  to  optimize  the  objective  function  by  W 
when  fixing  F .  Furthermore,  in  this  algorithm,  we  need  to  minimize  Function  1,  the 
decomposed  elements  of  which  can  be  independently  optimized  in  Algorithm  4. 


Algorithm  4:  Iterative  multiplicative  update 


Input:  V  e  7 Z™xm,  r,  and  aq,  a2,  /A,  P2  >  0 
Output:  W  G  TZ’lxr,  F  e  7Thxm 

1  begin 

2  Randomize  W  G  lZr*n] 

3  Randomize  F  G  7Zrxm:: 

4  while  convergence  condition  is  not  satisfied  do 

5  ids  =  a  randomized  ordered  set  of  values  {1,  2, ...,  r}; 

6  sumW  =  Wln ; 

7  /*()ptimizing  the  objective  function  by  F  when  fixing  W* / ; 

8  for  j  =  1  to  m  do 

9  /*Call  Algorithm  5  in  parallel*/ ; 

10  Fj+1  =  Algorithm  5  (Vj,  WT ,  surnW,  Ff,  a2,  @2,  ids) 

11  sumF  =  Flm ; 

12  /*Optimizing  the  objective  function  by  W  when  hxing  F* / ; 

13  for  i  =  1  to  n  do 

14  /*Call  Algorithm  5  in  parallel*/ ; 

15  W^+l  =  Algorithm  5  (Vfi  FT ,  sumW,  IF/,  a2 ,  fi2,  ids); 


16  return  (lFfc+1)T,  Fk+1; 
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m 


n 


(i) 


D(V\\WtF)  =  YjD{V1\\WtF1)  =  Y,D{VT\\FfTWj) 

i=  1  3= 1 

Specially,  a  number  of  optimization  problems  D(Vi\\WT Fi)  or  D{yJ'\\FTWj)  in 
Algorithm  4  can  be  independently  and  simultaneously  solved  by  Algorithm  5  in  this 
form: 


f(x)  =  D(v\\Ax)  =  ^2(vi\og  -rj—  -Vi+  [Ax]i)  +  ^\\x\\l  +  (3\x\i 

i= 1  ‘l 


(2) 


where  v  e  TVf,  A  e  7^xr,  x  G 

From  Equation  2,  the  first  and  second  derivative  of  the  variable  x^  are  computed 
by  Formula  3: 


V/fc  —  —  Afc  +  +  P 


i=  1 


v2/^-  =E^ 


2=1 


A-jk  \  2 
[Ar]i  / 


2=1 


“h  CK 


(3) 


Algorithm  4  is  designed  on  randomized  coordinate  descent  algorithm  because  up¬ 
dating  one  element  in  x  requires  a  heavy  computation. 

In  theoretical  analysis  of  the  method  we  obtained  the  following  result: 


Theorem  5  The  complexity  of  Algorithm  5  is  0(nnnz(r)  +  fr(r+n  +  nnz(n))),  where 
nnzfr)  is  the  number  of  non-zero  elements  in  x,  nnzfn)  is  the  number  of  non-zero 
elements  in  v,  and  t  is  the  average  number  of  iterations.  Then,  the  complexity  of  a 
while  iteration  in  Algorithm  f  is  0(t(mnr  +  (m  +  n)r2)) 


The  method  is  experimentally  evaluated  on  the  four  datasets,  see  Table  4,  in 
terms  of  convergence,  sparsity  for  factor  matrices,  the  use  of  internal  memory.  The 
proposed  algorithm  (Sparse  Randomized  Coordinate  Descent (SRCD))  is  compared 
with  two  state-of-the-art  methods  Cycle  Coordinate  Descent  (CCD)  and  Multiplica¬ 
tive  Update(MU). 

Table  5  shows  that  our  method  uses  less  internal  memory  than  the  others.  For 
the  dense  dataset  Digits,  the  proposed  algorithm  SRCD  uses  more  internal  memory 
than  the  algorithm  CCD  because  a  considerable  amount  of  memory  is  used  for  the 
indexing  of  matrices: 

Figure  6  and  7  show  the  running  time  of  Algorithm  SRCD  for  100  iterations 
with  different  number  of  latent  component  using  1  thread.  Clearly,  the  running  time 
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Digits  with  r  =  10 


-  SRCD--  CCD^MU 


TDT2  with  r  =  10 


107 

io6-8 


E*h 

h 

^io6-6 


101  102  103 
seconds 


Figure  6:  Objective  value  D(V\\WT F)  versus  running  time  with  r  =  10 


--SRCD---CCD^MU 

Digits  with  r  =  20  TDT2  with  r  =  20 


Figure  7:  Objective  value  D(V\\WT F)  versus  running  time  with  r  =  20 
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Algorithm  5:  Randomized  coordinate  descent  algorithm  for  sparse  datasets 
Input:  v  E  7 Zn,  A  e  7 Znxk,  sumA ,  x  e  77fc,  a  >  0,  (3  >  0,  and  variable  order 
ids 

Output:  x  is  updated  by 

n 

x  ~  argmin  ^  ~ vi  logQAx]*  +  e)  +  [Ax]i  +  |  ||x|||  +  /3 ||o: || i . 

x>zO  i=  1 

1  begin 

2  Compute  Ax  e  77" ; 

3  for  k  in  order  ids  do 

4  Compute  V/fc  and  V2  fkk  based  on  a,/3,v,x,Ax,sumA  and  sparsity  of 

v  based  on  Formula  3; 

5  while  (V/fc  <  —  e)  or  (\^7 fk\  >  e  and  x^  >  e)  do 

6  Ax  =  max( 0,  Xfc  —  ^2fkk )  — 

7  Update  Ax  via  Ax  =  Ax  +  AxAk  /* Re-computing  Ar  is  removed*/; 

8  xs  =  xk] 

9  xk  =  xk  +  Ax ; 

10  if  (Ax  <  e^xs)  then 

n  break; 

12  Update  V/fc  and  V2 fkk  based  on  a ,  /?,  n,  x,  sumA  and  sparsity  of  v 

based  on  Formula  3; 

13  return  x; 


Table  4:  Summary  of  datasets 


Dataset  (V) 

n 

m 

nnz(V) 

Sparsity  (%) 

Digits 

784 

60,000 

8,994,156 

80.8798 

Reuters21578 

8,293 

18,933 

389,455 

99.7520 

TDT2 

9,394 

36,  771 

1,224,135 

99.6456 

RCVl_4Class 

9,625 

29,992 

730, 879 

99.7468 

linearly  increases,  which  fits  the  theoretical  analyses  about  fast  convergence  and  linear 
complexity  for  large  sparse  datasets. 

Concerning  the  parallel  algorithm,  the  running  time  of  Algorithm  SRCD  for  100 
iterations  significantly  decreases  when  the  number  of  used  threads  increases  in  Fig¬ 
ure  8  and  9.  Furthermore,  Table  9  indicates  that  highly  sparse  models  and  sparse 
representation  can  be  attained  for  large  sparse  datasets,  which  is  a  significant  mile¬ 
stone  in  researching  this  problem.  In  addition,  the  running  time  is  acceptable  for 
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Tabic  5:  Used  internal  memory  (GB)  for  r  =  10 


Datasets 

MU 

CCD 

SRCD 

Digits 

1.89 

0.85 

1.76 

Reuters21578 

5.88 

2.46 

0.17 

TDT2 

11.51 

5.29 

0.30 

RCVl_4Class 

9.73 

4.43 

0.23 

Digits  -  ♦-  Reuters21578  TDT2  -  RCVl 4Class 


Figure  8:  Running  time  of  100  iterations  with  different  number  of  latent  component 
using  1  thread 


large  applications.  Hence,  these  results  indicate  that  the  proposed  algorithm  SRCD 
is  feasible  for  large-scale  applications. 

The  proposed  algorithm  attains  the  fastest  convergence  by  means  of  removing 
the  re-computation  Ax  over  iteration,  exploiting  sparse  properties  of  data,  model 
and  representation  matrices,  and  utilizing  the  fast  accessing  speed  of  CPU  cache. 
In  addition,  our  method  can  stably  run  systems  within  limited  internal  memory  by 
reducing  internal  memory  requirements.  In  future  research,  we  will  generalize  this 
algorithm  for  nonnegative  tensor  factorization. 


Table  6:  Sparsity  (%)  of  ( W ,  F)  for  the  algorithm  SRCD’s  results 


Digits 

Reuters21578 

TDT2 

RCV1  4Class 

r  =  10  (74.3,  49.2) 

(75.6,  71.6) 

(68.5,  71.3) 

(81.2,  74.0) 

r  =  20  (87.8,  49.7) 

(84.2,  80.4) 

(78.6,  81.1) 

(88.4,  83.0) 
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-•-Digits--.-  Reuters21578-*-TDT2 - ■  -RCVl-4Class 


Figure  9:  Running  time  of  100  iterations  with  r  =  50  using  different  number  of 
threads 

4.  New  model  of  simplicial  NMF  with  Frobenius  norm  and 
fast  parallel  algorithm 

New  NMF  model,  simplicial  NMF,  is  proposed  by  adding  simplicial  constraints  over 

r 

coefficients  ^  Fkj  —  1  Vj  that  expresses  the  probabilistic  combination  role  of  latent 

k=  1 

components  over  data  instances.  The  proposed  fast  parallel  algorithm  is  derived  from 
Frank- Wolfe  algorithm  for  inferring  instances  has  major  advantages: 

•  Sub-linear  convergence:  The  proposed  algorithm  guarantees  sub-linear  con¬ 
vergence  0(1/ k)  in  the  simplex  of  nonnegative  variables. 

•  Low  iteration  complexity:  The  iteration  complexity  of  proposed  algorithm  is 
0(r )  instead  of  0(r 2)  which  is  the  iteration  complexity  of  advanced  guaranteed 
algorithm  [5]. 

Theorem  6  The  complexity  of  inferring  a  data  instance  is  0(kr),  and  the 
complexity  of  inference  step  is  O  (rrinr  +  nr2  +  kmr),  where  k  is  the  iteration 
number,  n  is  the  instance  dimension,  and  m  is  the  instance  number. 

•  Sparsity  control:  Sparsity  of  coefficients  is  easily  controlled  in  simplicial  NMF 
with  Frobenius  norm. 

Theorem  7  Let  f  be  a  twice  differentiable  convex  function  over  simplex  A 
and  denote  Cf  =  supy^A-,y£[y,z}(y  ~  z) .V2 / (y) .(y  —  z)T .  After  l  iterations,  the 
Frank-Wolfe  algorithm  will  find  an  approximate  solution  xi  with  at  most  (l  +  1) 
non-zeros  coefficients  which  satisfy 
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maxxeAf(xi )  -  f(x)  <  ^ 


Algorithm  6:  Iterative  multiplicative  update  for  Frobenius  norm 


Input:  Data  matrix  V  =  {Vj}f=1  G  and  r. 

Output:  Coefficients  F  G  IFfi m  and  latent  components  W  G  Rrfi 

1  begin 

2  Select  randomly  r  components  from  m  data  instances; 

3  repeat 

4  q  =  —WV; 

5  Q  =  WWT; 

6  Inference  step:  Fix  W  to  find  F  «  argmin  J(V\\WT F)] 

R^RrXTn 


7 

8 
9 


/*Call  Algorithm  7  in  parallel*/; 
for  j  =  1  to  m  do 

Fj  «  argmin  ||Vj  —  fFra;||2  =  argmin  +  qjx 

x  inR+,xTlr=l  x  inR+,xT lr=l 


10 

11 

12 

13 

14 

15 


q  =  -FVT- 
Q  =  FFt; 

Learning  step:  Fix  F  to  find  W  ~  argmin  J(DT||FTfF); 

u/e'R.rx™ 

/*Call  solving  NQP  in  parallel*/; 
for  i  =  1  to  n  do 

Wi  ~  argmin  ||f/T  —  FTx  \  \  \  =  argmin  ^xTQx  +  qjx 

x  inRr+  x  in  R1^ 


16 

17 

18 


if  convergence  condition  is  satisfied  then 
j  break; 

until  False ; 


Our  proposed  method  employs  a  multiple  iterative  update  algorithm,  see  Algo¬ 
rithm  6,  like  EM  algorithm  containing  two  main  step:  (1)  Inference  step  optimizes 
D(V\\WT F)  by  F  to  infer  new  coefficients  of  data  instances,  which  calls  Algorithm  7; 
(2)  Learning  step  optimizes  D(VT\\FTW)  by  W  to  learn  the  parameters  W  of  NMF 
model,  which  calls  a  nonnegative  quadratic  programming  solver. 

Algorithm  7  infers  the  coefficients  of  each  instance,  which  is  derived  from  Frank- 
Wolfe  algorithm  to  achieve  guaranteed  convergence  and  to  control  effectively  the 
sparsity  of  representation.  Specially,  the  iteration  complexity  can  be  reduced  into 
O(r)  instead  of  0(r2). 

We  investigate  the  effectiveness  of  our  approach  and  the  proposed  algorithm  for 
simplicial  NMF  (sNMF)  with  Frobenius  norm  by  four  criteria:  interpretability,  spar- 
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Algorithm  7:  Inference  for  data  instance  x 


Input:  Q  G  1Zrxr,  q  G  W . 

Output:  New  coefficient  x  ~  argmin  f(x)  =  argmin  | xTQx  +  gTa;. 

xe7er  a;S7er 

1  begin 

2  Choose  k  =  arg  min  \e\Qek  +  gTefc,  where  ek  is  the  kth  basis  vector; 

k 

3  Set  x  =  0fc;  Xk  =  1;  Qx  =  ga;  =  qTx  and  V/  =  Qx  +  qT ; 

4  repeat 

5  Select  A;  =  argmin  V fk, 

&E{l..r} 

6  Select  a  =  argmin  f(aek  +  (1  —  ot)x ); 

a 

7  a  =  min{  1,  max(— 1,  max(a ,  —  ))); 

8  if  a  ==  0  then 

9  break; 

lo  Qx  =  (1  —  a)Qx  +  aQek, 

n  V/  =  Qx  +  g; 

L2  ga;  =  (1  —  a)ga;  +  age^; 

L3  a;  =  (1  —  a)a;; 

L4  Xk  =  xk  +  a; 

L5  until  converged  conditions  are  staisfied ; 
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sity  of  solutions,  performance  in  classification  tasks  and  loss  information  measure  on 
4000  random-selected  samples  from  Digit  datasets.  Our  algorithm  (sNMF)  for  Frobe- 
nius  form  is  compared  to  NMF  [6],  spNMF  [7],  oNMF[8],  and  cNMF  [9].  The  results 
indicate  that  the  proposed  algorithms  can  receive  both  highly  sparse  representation, 
see  Figure  12  and  13;  and  high  accuracy  in  classification,  see  Figure  14;  while  other 
algorithms  can  only  achieve  one  of  them. 


Figure  12:  Latent  components  with  K  =  25  for  digit  database  (the  left  figure  for 
NMF,  and  the  right  figure  for  sNMF) 


5.  New  model  of  simplicial  nonnegative  matrix 

tri-factorization  with  Frobenius  norm  and  fast  parallel 
algorithm 

Based  on  the  previous  research,  existing  NMF  models  remain  the  limitations  in  the 
terms  of  interpretability,  guaranteed  convergence,  computational  complexity,  and 
sparse  representation.  Continuously,  we  propose  to  add  simplicial  constraints  to  the 
classical  NMF  model  and  to  reformulate  it  into  a  new  model  called  simplicial  nonneg¬ 
ative  matrix  tri-factorization  to  have  more  concise  interpretability  via  these  values 
of  factor  matrices.  Then,  we  propose  an  effective  algorithm  based  on  a  combination 
of  three-block  alternating  direction  and  Frank- Wolfe’s  scheme  to  attain  linear  con¬ 
vergence,  low  iteration  complexity,  and  easily  controlled  sparsity.  The  experiments 
indicate  that  the  proposed  model  and  algorithm  outperform  the  NMF  model  and  its 
st  ate-of- 1  he-  art  algor  it  hms . 

Concerning  the  interpretability,  NMF  is  a  non-convex  problem  having  numerous 
solutions  as  stationary  points,  which  has  rotational  ambiguities  [10].  Particularly,  if 
V  ~  WTF  is  a  solution,  V  ~  W 1  [DfF'\  =  [ WtDf\F '  are  also  equivalent  solutions; 
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sNMF  - .  -  spNMF  •  cNMF  -  ■  -  oNMF  -  •  NMF 


Figure  13:  Sparsity  of  new  coefficients  for  Frobenius  norm  with  r 


sNMF  - .  -  spNMF  •  cNMF  -  ■■  -  oNMF  - .  NMF 
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Number  of  latent  components 


Figure  14:  Inaccuracy  for  Digit  Classification 
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where  Dp  is  a  positive  diagonal  matrix  satisfying  F  =  DpF' .  Hence,  it  does  not  con¬ 
sistently  explain  the  roles  of  latent  components  over  instances  and  the  contributions 
of  attributes  over  latent  components. 

To  resolve  the  limitation,  we  propose  a  new  formulation  as  SNMTF,  in  which  the 
data  matrix  is  factorized  into  a  product  of  three  matrices  V  ~  WT DF ,  where  D  is  a 

r  r 

positive  diagonal  matrix,  =  lVi,  and  Fkj  =  1  Vj. 

k= 1  /c=l 

However,  the  scaling  of  factors  via  the  diagonal  matrix  D  can  lead  to  the  in¬ 
consistency  of  interpreting  the  factor  matrices  W  and  F,  and  the  existence  of  bad 
stationary  points  such  as  A*,  =  0  =>■  Wkl  =  0,  F^j  =  0.  Hence,  we  restrict  the  for¬ 
mulation  by  adding  the  condition  X1  =  ...  =  Ar  =  A.  This  model  can  be  considered 
as  an  extension  of  the  probabilistic  latent  semantic  indexing  (PLSI)  [11]  for  scaling 
data  with  an  additional  assumption  that  the  weights  of  latent  factors  are  the  same. 
Remarkably,  their  roles  of  latent  components  over  instances  and  of  attributes  over 
latent  components  are  represented  via  the  values  of  the  factor  matrices  W  and  F. 
As  a  result,  it  is  easier  to  recognize  these  roles  than  in  the  case  that  the  weights  of 
latent  factors  can  be  different.  Therefore,  the  objective  function  of  SNMTF  with  L2 
regularizations  is  written  as  follows: 

'  nun  ^W,D,F):^l\\V-WTDFrF  +  f\\W\\l  +  f\\FrF} 

<  s.t.  Wki  =  1  (i  =  1,  •  •  •  ,  n),  E  Fkj  =  1  (j  =  1,  •  •  •  ,  m),  (4) 

W  e  M;xn,  F  e  M;xm,  D  =  diag(A,  •  •  •  ,  A). 

There  are  three  significant  remarks  in  this  formulation.  Firstly,  adding  simplicial 
constraints  leads  to  more  concise  interpretability  of  the  factor  matrices  and  conve¬ 
nience  for  post  processes  such  as  neural  network  and  support  vector  machine  because 
the  sum  of  attributes  is  normalized  to  1.  Secondly,  Li  regularization  is  ignored  be¬ 
cause  ||W||i  and  || -F|| !  equal  to  a  constant.  Finally,  the  diagonal  of  D  —  diag(A, ...,  A) 
has  the  same  value  because  of  two  main  reasons:  First,  it  is  assumed  that  Vv]  is  gener¬ 
ated  by  Wt ,  Fj  and  a  scale  as  VtJ  ^  A W^Fj;  Second,  it  still  retains  the  generalization 
of  SNMTF  which  every  solution  of  NMF  can  be  equivalently  represented  by  SNMTF. 

We  note  that  problem  (4)  is  nonconvex  due  to  the  product  WT DF.  We  first  pro¬ 
pose  to  use  three-block  alternating  direction  method  to  decouple  there  three  blocks. 
Then,  we  decompose  the  computation  onto  column  of  the  matrix  variables,  which  can 
be  conducted  in  parallel.  Finally,  we  apply  Frank- Wolfe’s  algorithm  [12]  to  solve  the 
underlying  subproblems. 
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We  decouple  the  tri-product  WT DF  by  the  following  alternating  direction  scheme, 
which  is  also  called  iterative  multiplicative  update: 

{Ft+i  :=  arg  min  ($(pW5  fit  F\  .  F  e  A"}, 

FeKrxm  L  J 

Wt+1  :=  arg  min  { $(W,  D\  Ft+1 )  :  W  G  A™}, 

WeR’’xn  W 

ZT+1  :=  argmin  D,  Ft+1)  :  D  =  diag(A,  •  •  •  ,  A)}. 


Clearly,  both  F-problem  and  W-problem  in  (5)  are  convex  but  are  still  constrained 
by  simplex,  while  the  .D-problcm  is  unconstrained.  We  now  can  solve  the  H-problem 
in  the  third  line  of  (5).  Dne  to  the  constraint  D  =  diag(A,  •  •  •  ,  A),  this  problem  turns 
out  to  be  an  univariate  convex  program,  which  can  be  solved  in  a  closed  form  as 
follows: 


Am  =  argmin  \\V  -  Wt+lT DFm \\\ 
AeiR 


(V,  Wt+lTFt+1) 

( Wt+1Wt+lT ,  Ft+1  Ft+lT) ' 


(6) 


Then,  we  form  the  matrix  Dt+1  as  Dm  =  diag(Am,  •  •  •  ,  Xt+i). 

If  we  look  at  the  F-problem,  then  fortunately,  it  can  be  separated  into  n  indepen¬ 
dent  subproblems  (j  —  1,  •  •  •  ,  m)  of  the  form: 


F r 


arg  mm 


(Wt)TDtFj  ||1  + 


:  Fj  G  A 


(7) 


The  same  trick  is  applied  to  the  W-problem  in  the  second  line  of  (5).  Now,  we 
assume  that  we  apply  the  well-known  Frank- Wolfe  algorithm  to  solve  (7),  then  we 
can  describe  the  full  algorithm  for  solving  (4)  into  Algorithm  8. 

The  stopping  criterion  of  Algorithm  8  remains  unspecified.  Theoretically,  we  can 
terminate  Algorithm  8  using  the  optimality  condition  of  (4).  However,  computing 
this  condition  requires  a  high  computational  effort.  We  instead  terminate  Algorithm 
8  if  it  does  not  significantly  improve  the  objective  value  of  (4)  or  the  differences 
Wt+ 1  —  Wt  and  Fm  —  Ft  and  the  maximum  number  of  iterations. 

Principally,  we  can  apply  any  convex  optimization  method  such  as  interior-point, 
active-set,  projected  gradient  and  fast  gradient  method  to  solve  QP  problems  of  the 
form  (7).  However,  this  QP  problem  (7)  has  special  structure  and  is  often  sparse.  In 
order  to  exploit  its  sparsity,  we  propose  to  use  a  Frank- Wolfe  algorithm  studied  in 
[13]  to  solve  this  QP  problem.  Clearly,  we  can  write  (7)  as  follows: 


x  pa  argmin 

xGAr 


1 

2 


v  —  ATx 


2 

2 


+ 
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IgA  r  2 


(8) 
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Algorithm  8:  Iterative  multiplicative  update  for  Frobenius  norm 
Input:  Data  matrix  V  =  {1^}™=1  G  M+Xm,  and  r,  ai,a2  >  0,/3  >  0. 
Output:  Coefficients  F  G  Rr_fim  and  latent  components  W  G  Rrfin 

1  begin 

2  Pick  an  arbitrary  initial  point  W  G  M+Xn  (e.g.,  random); 

3  repeat 

4  q  =  -AW;  Q  =  A 2WWT  +  ml; 

5  /^Inference  step:  Fix  W  and  D  to  find  new  F; 

6  for  j  =  1  to  m  do 

7  Fj  ~  argmin  ||xTQx  +  gJx}/*Call  Algorithm  9  in  parallel  */; 

xEAr 

8  g  =  —XFVT]  Q  =  X2FFt  +  a2I; 

9  /*Learning  step:  Fix  F  and  D  to  find  new  W; 

10  for  i  =  1  to  n  do 

n  Wi  ~  argmin  {^xTQx  +  qf x} /* Call  Algorithm  9  in  parallel  */; 

xGAr 

12  A  =  argmin  ||C  -  VFTDF|||  =  ; 

ask  ’ 

13  until  convergence  condition  is  satisfied ; 


Algorithm  9:  Fast  Algorithm  for  NQP  with  Simplicial  Constraint 
Input:  Q  G  Mrxr,  q  G  Mr. 


Output:  New  coefficient  x  «  argmin  /(x)  =  ^xTQx  +  gTx. 

xEAr 

1  begin 

2  Choose  k  =  argmin  ^ejQek  +  qTek,  where  is  the  basis  vector; 

k 

3  Set  x  =  0k ;  Xfc  =  1;  Qx  =  Qe*, ;  qx  =  gTx  and  V/  =  Qx  +  gr; 

4  repeat 

5  Select  k  =  argmin  {(e*,  —  x,  V/)}  or  {(x  —  e*,,  V/)|x*;  >  0}; 

fcS{l..r} 

6  Select  a  =  argmin  f(aek  +  (1  —  a)x); 

a 

7  a  =  mm(l,roax(a,'-Tr-)); 

i  X]$ 

8  Qx  =  (1  —  a)Qx  +  aQek]  V/  =  Qx  +  q; 

9  qx  —  (1  —  a)gx  +  age*,; 

10  x  =  (1  —  a)x;  Xfe  =  Xfc  +  a; 

n  until  converged  conditions  are  staisfied ; 

where  v  —  Vj,  A  =  DW ,  Q  =  AAT  +  a,  and  q  =  —Av.  By  applying  the  Frank- Wolfe 
algorithm  form  [13]  to  solve  this  problem,  we  obtain  Algorithm  9  below. 
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In  Algorithm  9,  the  first  derivative  of  f{x)  =  \xTQx  +  qTx  is  computed  by  V/  = 
Qx  +  q.  In  addition,  the  steepest  direction  in  the  simplex  is  selected  by  this  formula: 
k  =  argmin  {(efc  -  x,  V/)}  or  {(x  -  ek,  V/) \xk  >  0}. 

fe€{l..r} 

For  seeking  the  best  variable  a  to  minimize  f(ax  +  (1  —  ce)ek)  where  ek  is  the  kth 
unit  vector.  Let  consider  f(ax  +  (1  —  a)ek),  we  have: 


Qf 

-^(a  =  0)  =  (x  -  ek)T(Qx  +  q)  =  xT(Qx  +  q)  -  [Qx\k  -  qk 

d 2  f 

=  °)  =  (x  ~  ek)TQ(x  -  ek)  =  xTQx  -  2 [Qx\k  +  Qkk. 

Since  /  is  a  quadratic  function  of  a,  its  optimal  solution  is  a  = 


(9) 


argmin 


/(( 1  —  a)x  +  aek)  =  [— *k  The  projection  of  solution  over  the  interval 
[— 1]  is  to  guarantee  xk  >  0  \/k.  The  updates  x  —  (1  —  a)x  and  xk  —  xk  +  a  are 
to  retain  the  simplicial  constraint  x  G  Ar. 

In  Algorithm  9,  duplicated  computation  is  removed  to  reduce  the  iteration  com¬ 
plexity  into  O(r)  by  maintaining  Qx  and  qTx.  This  result  is  highly  competitive 
with  the  state-of-the-art  algorithm  having  a  sub- linear  convergence  rate  0{l/k2)  and 
complexity  of  0(r'2)  [5]. 

This  section  discusses  three  important  aspects  of  the  proposed  algorithm  as  conver¬ 
gence,  complexity,  and  generalization.  Concerning  the  convergence,  setting  ai,oi2  > 
0,  based  on  Theorem  3  in  Lacoste-Julien,  S.,  &  Jaggi,  M.  (2013)  [13],  since  f(x)  = 
| xTQx  +  qTx  is  smoothness  and  strongly  convex,  we  have: 


Theorem  8  Algorithm  9  linearly  converges  as  f(xk+i)  —  f(x*)  <  (1  —p^w)k  (f(x o)  — 

fiFW 

f(x*)),  where  pjW  =  {|,  Cf  is  the  curvature  constant  of  the  convex  and  dif¬ 

ferentiable  function  f,  and  HjW  is  an  affine  invariant  of  strong  convex  parameter. 


Since  Algorithm  9  always  linearly  converges  and  the  objective  function  restrictedly 
decreases,  Algorithm  6  always  converges  stationary  points.  Regarding  the  complexity 
of  the  proposed  algorithm,  we  have: 


Theorem  9  The  complexity  of  Algorithm  9  is  0(r2  +  tr),  and  the  complexity  of  each 
iteration  in  Algorithm  6  is  0(mnr  +  (m  +  n)r2  +  t(m  +  n)r). 

Proof  The  complexity  of  the  initial  computation  Qx  +  q  is  0(r2).  and  each  iter¬ 
ation  in  Algorithm  9  is  0(r).  Hence,  the  complexity  of  Algorithm  9  is  0(r2  +  tr). 
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The  complexity  of  main  operators  in  Algorithm  6  as  WV,  FVT,  FF 7 ,  and  WWT 
is  0{mnr  +  (m  +  n)r2).  Overall,  the  complexity  of  each  iteration  in  Algorithm  6  is 
0(mnr  +  (m  +  n)r 2  +  t(m  +  n)r).  ■ 


This  is  highly  competitive  with  the  guaranteed  algorithm  [5]  having  the  complexity 
of  0(mnr  +  (m  +  n)r2  +  t(rn  +  n)r2).  Furthermore,  adding  the  simplicial  constants 
into  NMF  does  not  reduce  the  generalization  and  flexibility  of  NMF : 


Theorem  10  Each  solution  of  NMF  can  be  equivalently  transfered  into  SNMTF. 


Proof  Assume  that  V  ~  WTF ,  which  leads  to  the  existence  of  A  large  enough  to  sat¬ 
isfy  WTF  =  W'tD'F' ,  where  D'  =  diag(A, A),  Wki  <  1  V/,  and  Yh  Fkj  <  1  Vj. 


k= 1  k= 1 

n/T  n"  Z7"  _  t m’T  rv  t?'  _  tt/T’  i 


Therefore,  3 W",  D ",  and  F":  W"TD"F"  =  W'TD'F'  =  WTF ,  where  W”  =  W'i} 


r+ 1 


iji 


Ki  =  K,,  w-'"+M  =  o,w;'+v  =  i  -  E  wi!i  «,  f?+1j  =  i  -  e  =  o  vj 

^ — 1  k:==  1 

,D"  =  diag(A, A).  W"TD"F "  is  SNMTF  of  V.  U 


The  generalization  is  crucial  to  indicate  the  robustness  and  high  flexibility  of  the 
proposed  model  in  comparison  with  NMF  models,  although  many  constraints  have 
been  added  to  enhance  the  quality  and  interpretability  of  the  NMF  model. 

The  proposed  algorithm  SNMTF  is  compared  with  the  following  methods: 

•  NeNMF  [5]:  It  is  a  guaranteed  method,  each  alternative  step  of  which  sublin- 
early  converges  at  0{l/k2)  that  is  highly  competitive  with  the  proposed  algo¬ 
rithm. 

•  LeeNMF  [14]:  It  is  the  original  gradient  algorithm  for  NMF. 

•  PC  A:  It  is  considered  as  a  based-line  method  in  dimensionality  reduction,  which 
is  compared  in  classification  and  sparsity. 

Datasets:  We  compared  the  selected  methods  in  three  typical  datasets  with 
different  size,  namely  Faces1,  Digits2  and  Tiny  Images  3. 

Environment  settings:  We  develop  the  proposed  algorithm  SNMTF  in  Matlab 
with  embedded  code  C++  to  compare  them  with  other  algorithms.  We  set  system 

1.  http : //cbcl . mit . edu/ cbcl/ software- dataset s/FaceData. html 

2.  http : //yann . lecun. com/ exdb/mnist/ 

3.  http : //horat io . cs . nyu . edu/mit/t iny/ data/ index . html 
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Tabic  7:  Dataset  Information 


Datasets 

n 

m 

testing  size 

r 

#class 

Faces 

361 

6,977 

24,045 

30 

2 

Digits 

784 

6.104 

104 

60 

10 

Cifar-10 

3,072 

5.104 

104 

90 

10 

Figure  15:  Convergence  of  loss  information  f\jfk  versus  time 


parameters  to  use  only  six  threads  in  the  machine  Mac  Pro  6-Core  Intel  Xeon  E5  3 
GHz  32GB.  The  initial  matrices  W°  and  F°  are  set  to  the  same  values,  the  maximum 
number  of  iterations  is  500.  The  source  code  is  published  on  our  homepage  4. 

We  investigate  the  convergence  of  the  compared  algorithms  by  ^  because  they 
have  the  different  formulations  and  objective  functions.  Fig.  15  clearly  shows  that 
the  proposed  algorithm  converges  much  faster  than  the  other  algorithms.  The  most 
steepest  line  of  the  proposed  algorithm  represents  its  fast  convergence.  This  result  is 
reasonable  because  the  proposed  algorithm  has  a  faster  convergence  rate  and  lower 
complexity  than  the  state-of-the-art  algorithm  NeNMF. 

Concerning  the  classification  performance,  the  training  datasets  with  labels  are 
used  to  learn  gradient  boosting  classifiers  [15,  16],  one  of  the  robust  ensemble  meth¬ 
ods,  to  classify  the  testing  datasets.  The  proposed  algorithm  outperforms  the  other 
algorithms  and  PCA  over  all  the  datasets.  For  the  small  and  easy  dataset  Face,  the 
result  of  the  proposed  algorithm  is  close  to  the  results  of  NeNMF.  However,  for  larger 
and  more  complex  datasets  Digit  and  Tiny  Images,  the  proposed  algorithm  has  much 
better  accuracy  than  the  other  algorithms.  Noticeably,  the  result  of  Tiny  Images  is 
much  worse  than  the  result  of  the  other  datasets  because  it  is  highly  complicated  and 
contains  backgrounds.  This  classification  result  obviously  represents  the  effectiveness 
of  the  proposed  model  and  algorithm. 


4.  http://klmongnd.appspot.com/ 
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Table  8:  Classification  inaccuracy  (%) 


Dataset 

PCA 

LeeNMF 

NeNMF 

SNMTF 

Faces 

6.7 

3.3 

2.7 

2.6 

Digits 

30.15 

12.16 

3.9 

3.6 

Tiny  Images 

59.3 

71.4 

51.4 

50.1 

Tabic  9:  Sparsity  of  factor  matrices  (%)  of  F,  W,  and  (both  F  and  W) 


Dataset 

PCA 

LeeNMF 

NeNMF 

SNMTF 

Faces 

Digits 

Tiny  Images 

(0,  0.26,  0.24) 
(1.37,  0,  0.02) 
(0,  0,  0) 

(0.58,  0,  0.55) 
(31.24,  50.47,  31.49) 
(0.02,  0,  0.02) 

(7.64,  60.33  ,  10.23) 
(41.20,  92.49,  41.86) 
(9.97,  86.58  ,  14.40) 

(9.78 , 59.98,  12.24  ) 
(50.49 , 93.47 , 51.04  ) 
(11.71 , 85.29,  15.97  ) 

We  investigate  the  sparsity  of  factor  matrices  F,  W,  and  the  sparsity  average 
in  both  F  and  W.  For  the  dataset  Digit,  the  proposed  algorithm  outperforms  in 
all  these  measures.  For  the  other  datasets  Faces  and  Tiny  Images,  it  has  better 
representation  F  and  more  balance  between  sparsity  of  F  and  W.  Frankly  speaking, 
in  these  datasets,  achieving  more  sparse  representation  F  is  more  meaningful  than 
achieving  more  sparse  model  W  because  F  is  quite  dense  but  W  is  highly  sparse, 
which  is  a  reason  to  explain  why  SNMTF  has  the  best  classification  result. 

This  research  proposes  a  new  model  of  NMF  as  SNMTF  with  L-2  regularizations, 
which  has  more  concise  interpretability  of  the  role  of  latent  components  over  instances 
and  attributes  over  latent  components  while  keeping  the  generalization  in  comparison 
with  NMF.  We  design  a  fast  parallel  algorithm  with  guaranteed  convergence,  low 
iteration  complexity,  and  easily  controlled  sparsity  to  learn  SNMTF,  which  is  derived 
from  Frank- Wolfe  algorithm  [13].  Furthermore,  the  proposed  algorithm  is  convenient 
to  parallelize  and  distribute,  and  control  sparsity  of  both  new  representation  F  and 
model  W.  Based  on  the  experiments,  the  new  model  and  the  proposed  algorithm 
outperform  the  NMF  model  and  its  state-of-the-art  algorithms  in  three  significant 
aspects  of  convergence,  classification,  and  sparsity.  Therefore,  we  strongly  believe 
that  SNMTF  is  highly  potential  for  many  applications  and  extensible  for  nonnegative 
tensor  factorization. 
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6.  New  model  of  simplicial  NMF  with  Kullback-Leibler 
divergence  and  fast  parallel  algorithm 

This  study  extends  simplicial  nonnegative  matrix  factorization  for  KL  divergence 
by  proposing  a  fast  parallel  algorithm  having  guaranteed  sub-linear  convergence  in 
inference  for  large  sparse  datasets  to  archive  sparse  representation.  The  proposed 
algorithm  has  significant  properties  as  follows: 

•  Sub-linear  convergence:  The  proposed  algorithm  guarantees  sub-linear  con¬ 
vergence  in  the  simplex  of  nonnegative  variables.  Remarkably,  no  existing  al¬ 
gorithm  for  NMF  with  KL  divergence  has  this  convergence  guarantee. 

•  Low  complexity:  The  iteration  complexity  of  proposed  algorithm  is  0(k[r:nnz 
( n )  +  n log-])  which  is  feasible  for  large  sparse  datasets.  Furthermore,  the 
computation  for  learning  step  is  extremely  fast  with  the  iteration  complexity 
because  of  the  data  sparsity  (■ nnz(n )  <C  n) . 

Theorem  11  Consider  Algorithm  11  to  infer  a  data  instance  having  n-dimension 
by  r  latent  components  with  k  iterations.  Then,  its  complexity  is  0(k[r.nnz(n)  + 
n  log-]),  where  nnz(n)  is  the  number  of  non-zero  elements  in  v. 

•  Sparsity  control:  Sparsity  is  easily  controlled  for  simplicial  NMF  with  KL 
divergence  because  the  proposed  algorithm  is  derived  from  Frank- Wolfe  algo¬ 
rithm. 

Our  proposed  method  employs  a  multiple  iterative  update  algorithm,  see  Algo¬ 
rithm  10,  like  EM  algorithm  containing  two  main  step:  (1)  Inference  step  optimizes 
D{V\\WT F)  by  F,  which  calls  Algorithm  11;  (2)  Learning  step  optimizes  D(V\\WT F) 
by  W,  the  solution  of  which  can  be  directly  approximated. 

Algorithm  11  infers  the  coefficients  of  each  instance,  which  is  derived  from  Frank- 
Wolfe  algorithm  to  achieve  guaranteed  convergence  and  to  control  effectively  the 
sparsity  of  representation. 

We  investigate  the  effectiveness  of  our  approach  and  the  proposed  algorithm  for 
simplicial  NMF  with  KL  divergence  by  four  criteria:  sparsity  of  solutions,  performance 
in  classification  tasks  and  loss  information  measure.  The  proposed  algorithm  for 
simplicial  NMF  with  KL  divergence  (sNMF)  is  is  compared  to  kl-NMF[6],  local  non¬ 
negative  matrix  factorization  (locNMF)[17],  convolutional  NMF(conNMF)  [18],  and 
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Figure  16:  Sparsity  of  new  coefficients  for  KL  divergence  with  r  =  30 
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Figure  17:  Inaccuracy  for  Spam  Classification 
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Algorithm  10:  Iterative  multiplicative  update  for  KL  divergence 


Input:  Data  matrix  V  =  {Vj}'fi=1  6  Rrfi<rn  and  r. 

Output:  Coefficients  F  e  IFfi m  and  latent  components  W  G  -R+ 

1  begin 

2  Select  randomly  r  components  from  m  data  instances; 

3  repeat 

Inference  step:  Fix  W  to  find  F  «  argmin  J(V\\WT F); 

F'(zRr  X  rri 


5 

6 

7 

8 


Compute  swmfF  =  fFln; 

/*Call  Algorithm  11  in  parallel*/; 
for  j  =  1  to  m  do 

Fj  ~  argmin  -DxL(kj||kFT:c)  with  sumW ; 

a:  mi?y,xTlr=l 


9 

10 

11 

12 


Learning  step:  Fix  F  to  find  W  ~  argmin  J(VT\\FTW)', 

weizrxn 

/*Learning  new  latent  components  by  approximation  algorithm*/; 
for  i  =  1  to  n  do 

sr^rn  y 

Wi  «  argmin  DKL(Vfi\\FTx)  => 

x  inRr+  Ai=i  kl 


13 

14 

15 


if  convergence  condition  is  satisfied  then 
j  break; 

until  False ; 


Nonsmooth  Nonnegative  Matrix  Factorization  (nsNMF)[19]  on  4327  labeled  spam 
emails  with  32906  distinct  terms. 


sNMF  -  -*  -  spNMF  cNMF  -  ■  -  oNMF  -  -t-  kl-NMF 


Figure  18:  Information  Loss  for  KL  divergence  with  r  =  30 
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The  experimental  results  indicate  that  the  proposed  algorithm  can  achieve  highest 
sparse  representation  (see  Figure  16),  highly  competitive  accuracy  in  classification  (see 
Figure  IT),  and  the  fastest  convergence  versus  iteration  (see  Figure  18). 
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